The data for this project are obtained from the WeRateDogs Twitter archive in one form or another. There are three pieces of the data that come in different forms:
twitter_archive_enhanced.csv file will be manually downloaded from Udacity's course web pageimage_prediction.tsv file will be downloaded programatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsvtweet_json.txt on a local machineData wrangling, which consists of:
- Gathering data
- Assessing data
- Cleaning data
Storing, analyzing, and visualizing your wrangled data
After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.
Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).
Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.
Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.
Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.
Here it begins ...
import numpy as np
import pandas as pd
import os
import re
import time
import requests
import tweepy
import json
import matplotlib.pyplot as plt
%matplotlib inline
# Create folder for the data sets
folder_name = 'we_rate_dogs'
if not os.path.exists(folder_name):
os.makedirs(folder_name)
# Read twitter-archive-enhanced.csv and image-predictions.tsv files
folder_name = 'we_rate_dogs'
twitter_archive_enhanced = pd.read_csv(os.path.join(folder_name, 'twitter-archive-enhanced.csv'))
# URL for the image-prediction.tsv file
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions\
/image-predictions.tsv"
folder_name = 'we_rate_dogs'
response = requests.get(url)
# Write image-predictions file to folder
filename = os.path.join(folder_name, url.split('/')[-1])
with open(filename, mode='wb') as file:
file.write(response.content)
folder_name = 'we_rate_dogs'
image_predictions = pd.read_csv(os.path.join(folder_name, 'image-predictions.tsv'), sep='\t')
Since I have already gotten timestamp and image-url (jpg_url) information in the two files above, I don't need to store these information again in the below file obtained from tweet JSON data.
tweet_ids = twitter_archive_enhanced['tweet_id'].astype(str)
# Query data from the WeRateDog Twitter archive using the Tweepy library
# consumer_key = ************************
# consumer_secret = ***********************************
# access_token = ***************************************
# access_secret = ************************************
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify=True)
start = time.time()
data = {}
data['dog_rating'] = []
tweet_errors = []
for tweet_id in tweet_ids:
try:
tweet = api.get_status(tweet_id, tweet_mode='extended')
data['dog_rating'].append({
'tweet_id': int(tweet.id),
#'created_at': tweet.created_at,
#'img_url': tweet._json['entities']['media'][0]['media_url_https'],
'retweet_count': tweet.retweet_count,
'favorite_count': tweet.favorite_count
})
except Exception as e:
print(tweet_id + " - " + str(e))
tweet_errors.append(tweet_id)
end = time.time()
print('Elapsed time:', end - start)
Acknowledgement: The helpful definition for serializing the datetime object tweet.created_at that was not serializable by the default json code is obtained from a Stack Overflow solution.
from datetime import date, datetime
def json_serial(obj):
"""JSON serializer for objects not serializable by default json code"""
if isinstance(obj, (datetime, date)):
return obj.isoformat()
raise TypeError ("Type %s not serializable" % type(obj))
folder_name = 'we_rate_dogs'
file_name = os.path.join(folder_name, 'tweet_json.txt')
with open(file_name, 'w') as outfile:
json.dump(data, outfile, default=json_serial)
folder_name = 'we_rate_dogs'
file_name = os.path.join(folder_name, 'tweet_json.txt')
with open ('tweet_json_img.txt') as json_file:
data = json.load(json_file)
tweet_json = pd.DataFrame(data['dog_rating'],
columns=['tweet_id', 'retweet_count', 'favorite_count'])
twitter_archive_enhanced - data:
twitter_archive_enhanced.info()
Issue: timestamp and retweeted_status_timestamp are of data types strings instead of datetime
twitter_archive_enhanced.sample(10)
twitter_archive_enhanced.name.value_counts()
Issue: Some dog names don't seem to be valid dog names and they either 'None' or seem to start with lowercase letters.
# Unreal dog names that also start with lowercase letters or labeled as 'None'
twitter_archive_enhanced.name.str.extract('(^[a-z]+|None)', expand=True).loc[:,0].value_counts()
# I will call the list of these unreal seeming names as unreal_names
unreal_names = twitter_archive_enhanced.name.str.extract('(^[a-z]+|None)', expand=True).loc[:,0].value_counts().keys()
unreal_names
# Look at values in rating_denominator
twitter_archive_enhanced.rating_denominator.value_counts()
Note: Rating denominator almost always equals 10. Here we see rating denominators that are not equal to 10. Those that are multiples of 10 may be for ratings involving multiple dogs in an image. However, rating denominator values such as 11, 2, 16, 15, 7, and certainly 0 don't make senese.
# Store the above unexpected denominator values in unreal_denoms
unreal_denoms = [0, 2, 7, 11, 15, 16]
for idx in twitter_archive_enhanced[twitter_archive_enhanced.rating_denominator.isin(unreal_denoms)].index:
print(str(idx) + " - ", twitter_archive_enhanced.loc[idx, 'text'])
print('\n')
Issues: As we can see from the texts corresponding to the observations with rating_denominators: 0, 2, 7, 11, 15, and 16, ratings such as 960/00, 24/7, 9/11, 20/16, 1/2 are not real ratings at all.
ser = twitter_archive_enhanced.rating_numerator.value_counts()
ser
ser[:8].sum() / ser.sum()
Note: about 93.5% of the rating numerators range between 7 and 14. However, there are a few very large and very small values. The large numerator values could as well be ratings for a number of dogs in the same picture combined. I will see if there are outliers by divinding rating_numerator by rating_denominator after removing the observation with zero rating_denominator.
Issue: rating_numerator and rating_denominator together convey a single value, that is a rating for the dog, and hence we don't need two separate columns for dog rating in the final cleaned version of our data file (separating the two, however, is a very good idea for clean up purposes).
twitter_archive_enhanced[['doggo', 'puppo', 'pupper', 'floofer']].describe()
Note: according to THE DOGTIONARY:
doggo - represents a big pupper, usually older pupper - represents a small doggo, usually younger, andpuppo - represents a transitional phase between pupper and doggo.floof - represents any dog really. However, this label is commonly given to dogs with seemingly excess fur.According to THE DOGTIONARY, floofer does not seem to be a dog 'stage'.
Issue: the twitter_archive_enhanced data frame has three columns for one feature - dog 'stage' classification. This breaks the first of the three Tidy data requirements - that each variable forms a column.
image_predictions - data:
image_predictions.info()
image_predictions.img_num.value_counts()
image_predictions.p1.value_counts()
Issue: there is an inconsistency in the way dog breeds or prediction 'things' are spelled. Some are written in all lowercase letters and some are capitalized.
image_predictions.p1_dog.value_counts()
image_predictions[['p1_conf', 'p2_conf', 'p3_conf']].describe()
# What is in the p1_dog column?
image_predictions[image_predictions.p1_dog].p1.value_counts()
# Predictions that are not dog breeds
image_predictions[~image_predictions.p1_dog].p1.value_counts()
tweet_json - data:
tweet_json.info()
tweet_json[['retweet_count', 'favorite_count']].describe()
tweet_json.tail()
Issue: there are multiple duplicated rows
Issue 1: (Quality Issue) twitter_archive_enhanced data - timestamp and retweeted_status_timestamp are of data types strings instead of datetime
Issue 2 (Quality Issue) twitter_archive_enhanced data - Some dog names don't seem to be valid dog names and they start with lowercase letters
Issue 3: (Quality Issue) twitter_archive_enhanced data - There are rating denominators that are not equal to 10. Those that are multiples of 10 may be for ratings involving multiple dogs in an image. However, rating denominator values such as 11, 2, 16, 15, 7, and certainly 0 don't make senese.
Issue 4: (Quality Issues) twitter_archive_enhanced data - about 93.5% of the rating numerators range between 7 and 14. However, there are a few very large and very small values. The large numerator values could as well be ratings for a number of dogs in the same picture combined. I will see if there are outliers by divinding rating_numerator by rating_denominator after removing the observation with zero rating_denominator.
Issue 5: (Tidiness Issue) twitter_archive_enhanced data - rating_numerator and rating_denominator together convey a single value, that is a rating for the dog, and hence we don't need two separate columns for dog rating in the final cleaned version of our data file (separating the two, however, is a very good idea for clean up purposes).
Note: according to THE DOGTIONARY:
doggo - represents a big pupper, usually older pupper - represents a small doggo, usually younger, andpuppo - represents a transitional phase between pupper and doggo.floof - represents any dog really. However, this label is commonly given to dogs with seemingly excess fur.According to THE DOGTIONARY, floofer does not seem to be a dog 'stage'. This is the reason why I thought melting `floofer` along with the other dog "stage" names was not the right course of action
Issue 6: (Tidiness Issue) the twitter_archive_enhanced data frame has three columns for one feature - dog 'stage' classification. This breaks the first of the three Tidy data requirements - that each variable forms a column.
Issue 7: (Quality Issue) image_predictions data - there is an inconsistency in the way dog breeds or prediction 'things' are spelled. Some are written in all lowercase letters and some are capitalized.
Issue 8: (Tidiness Issue) tweet_json data - there are multiple duplicated rows
twitter_archive_enhanced_clean = twitter_archive_enhanced.copy()
image_predictions_clean = image_predictions.copy()
tweet_json_clean = tweet_json.copy()
unreal_names = twitter_archive_enhanced.name.str.extract('(^[a-z]+|None)', expand=True).loc[:,0].value_counts().keys()
indexes = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.name.isin(unreal_names)].index
# Replace all those unrealistic dog names by None
twitter_archive_enhanced_clean.loc[indexes, 'name'] = 'None'
twitter_archive_enhanced_clean.name.str.extract('(^[a-z])', expand=True).loc[:,0].value_counts()
indexes = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating_denominator.isin([0, 7, 15, 16])].index
twitter_archive_enhanced_clean.drop(labels=indexes, axis=0, inplace=True)
twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating_denominator.isin([0, 7, 15, 16])].shape
indexes = twitter_archive_enhanced_clean[(twitter_archive_enhanced_clean.rating_numerator == 9) & (twitter_archive_enhanced_clean.rating_denominator == 11)].index
twitter_archive_enhanced_clean.loc[indexes, 'rating_numerator'] = 14
twitter_archive_enhanced_clean.loc[indexes, 'rating_denominator'] = 10
idx = twitter_archive_enhanced_clean[(twitter_archive_enhanced_clean.rating_numerator == 7) & (twitter_archive_enhanced_clean.rating_denominator == 11)].index
twitter_archive_enhanced_clean.loc[idx, 'rating_numerator'] = 10
twitter_archive_enhanced_clean.loc[idx, 'rating_denominator'] = 10
idx = twitter_archive_enhanced_clean[(twitter_archive_enhanced_clean.rating_numerator == 1) & (twitter_archive_enhanced_clean.rating_denominator == 2)].index
twitter_archive_enhanced_clean.loc[idx, 'rating_numerator'] = 9
twitter_archive_enhanced_clean.loc[idx, 'rating_denominator'] = 10
twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating_denominator.isin([2, 11])].shape
While the rating_numerator and rating_denominator separate columns served very important roles in our clean up efforts above, they represent two columns for the same variable (i.e., dog rating) and that violates one of the three Tidy data requirements
Create a new column rating by dividing rating_numerator by rating_denominator, then drop these two columns
twitter_archive_enhanced_clean['rating'] = twitter_archive_enhanced_clean['rating_numerator'] / twitter_archive_enhanced_clean['rating_denominator']
twitter_archive_enhanced_clean.drop(labels=['rating_numerator', 'rating_denominator'], axis=1, inplace=True)
list(twitter_archive_enhanced_clean)
twitter_archive_enhanced_clean.rating.sample(6)
# Some stats of the variable rating = rating_numerator / rating_denominator
twitter_archive_enhanced_clean.rating.describe()
twitter_archive_enhanced_clean.rating.sort_values(ascending=False)
# Texts corresponding to very high rating values and rating values of 0
indexes = twitter_archive_enhanced_clean[(twitter_archive_enhanced_clean.rating == 0) | (twitter_archive_enhanced_clean.rating > 1.5)].index
for idx in indexes:
print(str(idx) + " - ", twitter_archive_enhanced_clean.loc[idx, 'text'])
print('\n')
As we see in the above texts, there appear to be some errors in parsing dog ratings.
Also, the first of the two 420/10 ratings does not correspond to a dog rating rather it is of Snoop Dog's.
idx = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating == 2.6].index
twitter_archive_enhanced_clean.loc[idx, 'rating'] = 1.126
idx = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating == 2.7].index
twitter_archive_enhanced_clean.loc[idx, 'rating'] = 1.127
idx = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating == 7.5].index
twitter_archive_enhanced_clean.loc[idx, 'rating'] = 0.975
idx = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating==42.0].index[0]
twitter_archive_enhanced_clean.drop(labels=idx, axis=0, inplace=True)
twitter_archive_enhanced_clean[(twitter_archive_enhanced_clean.rating == 0) | (twitter_archive_enhanced_clean.rating > 1.5)].rating
Notice that the ratings we wanted to replace are replaced and we leave the other outliers as they are
image_predictions_clean[['p1', 'p2', 'p3']] = image_predictions_clean[['p1', 'p2', 'p3']].applymap(lambda x: x.title())
image_predictions_clean.p1.value_counts()
tweet_json - data
tweet_json_clean.info()
tweet_json_clean[['retweet_count', 'favorite_count']] = tweet_json_clean[['retweet_count', 'favorite_count']].fillna(0).astype(int)
tweet_json_clean.info()
tweet_json_clean.sample(5)
Duplicates?
sum(twitter_archive_enhanced_clean.duplicated()), sum(image_predictions_clean.duplicated()), sum(tweet_json_clean.duplicated())
tweet_json_clean = tweet_json_clean[~tweet_json_clean.tweet_id.duplicated()]
sum(tweet_json_clean.duplicated())
twitter_archive_enhanced_clean.tweet_id[:3]
image_predictions_clean.tweet_id[:3]
tweet_json_clean.tweet_id[:3]
# tweet_id in the twitter_archive_enhanced_clean data appears in descending order
twitter_archive_enhanced_clean.tweet_id.equals(twitter_archive_enhanced_clean.sort_values(by='tweet_id', ascending=False)['tweet_id'])
# tweet_id in the image_prediction_clean data appears in ascending order
image_predictions_clean.tweet_id.equals(image_predictions_clean.sort_values(by='tweet_id')['tweet_id'])
# tweet_id in the tweet_json_clean data appears in descending order
tweet_json_clean['tweet_id'].equals(tweet_json_clean.sort_values(by='tweet_id', ascending=False)['tweet_id'])
In the image_predictions data, tweet_id appears in increasing order, however, in the other two data sets tweet_id appears in descending order. For uniformity, I will arrange all in increasing order.
twitter_archive_enhanced_clean = twitter_archive_enhanced_clean.sort_values(by='tweet_id')
tweet_json_clean = tweet_json_clean.sort_values(by='tweet_id')
ser = pd.Series(list(twitter_archive_enhanced_clean) + list(image_predictions_clean) + list(tweet_json_clean))
ser[ser.duplicated()]
twitter_archive_draft = pd.merge(tweet_json_clean, image_predictions_clean, on='tweet_id')
twitter_archive_draft.shape
twitter_archive_draft = pd.merge(twitter_archive_draft, twitter_archive_enhanced_clean, on='tweet_id')
twitter_archive_draft.shape
twitter_archive_draft.sample(5)
twitter_archive_draft has three columns for one feature - dog 'stage' classification. This violates the first of the three Tidy data requirements - that each variable forms a column
Melt the doggo, puppo, and pupper columns dog_stage_name and dog_stage columns. Then drop the dog_stage_name column because the values in three columns doggo, puppo, and pupper are the stage names themselves, the dog_stage_name column will not have any unique information.
twitter_archive_draft.columns
twitter_archive_draft = pd.melt(twitter_archive_draft,
id_vars=['tweet_id','retweet_count','favorite_count','jpg_url','img_num',
'p1','p1_conf','p1_dog','p2','p2_conf','p2_dog','p3','p3_conf',
'p3_dog','in_reply_to_status_id','in_reply_to_user_id','timestamp',
'source','text','retweeted_status_id','retweeted_status_user_id',
'retweeted_status_timestamp','expanded_urls','name','floofer','rating'],
value_vars=['doggo','pupper','puppo'],
var_name='dog_stage_name',
value_name='dog_stage')
twitter_archive_draft = twitter_archive_draft.drop('dog_stage_name', axis=1)
twitter_archive_draft.columns
sum(twitter_archive_draft.duplicated())
There are 3806 duplicated observations, and I will drop these duplicates
twitter_archive_draft = twitter_archive_draft.drop_duplicates()
sum(twitter_archive_draft.duplicated())
twitter_archive_draft.tail()
twitter_archive_draft[twitter_archive_draft.dog_stage=='doggo'].shape
twitter_archive_draft.info()
timestamp, retweet_status_timestamp, floofer, and dog_stage have improper data typestimestamp and retweet_status_timestamp to datetimefloofer variable to booleandog_stage to categorical data types # Replace 'None' by np.nan
twitter_archive_draft = twitter_archive_draft.replace('None', np.nan)
# timestamp and retweet_status_timestamp to datetime
twitter_archive_draft.timestamp = pd.to_datetime(twitter_archive_draft.timestamp)
twitter_archive_draft.retweeted_status_timestamp = pd.to_datetime(twitter_archive_draft.retweeted_status_timestamp)
# 'floofer' to bool
twitter_archive_draft['floofer'] = twitter_archive_draft['floofer'].map(lambda x: 1 if pd.notnull(x) else 0).astype('bool')
twitter_archive_draft['floofer'].value_counts()
# dog_stage as category
twitter_archive_draft.dog_stage = twitter_archive_draft.dog_stage.astype('category')
twitter_archive_draft.info()
Store the `twitter_archive_draft` data frame in `twitter_archive_master.csv` CSV file
folder_name = 'we_rate_dogs'
filename = os.path.join(folder_name, 'twitter_archive_master.csv')
twitter_archive_draft.to_csv(filename, index=False)
Connect to a database
from sqlalchemy import create_engine
Create SQLAlchemy Engine and empty we_rate_dogs database
engine = create_engine('sqlite:///we_rate_dogs.db')
Store twitter_archive_draft DataFrame in a table called twitter_archive_master in we_rate_dogs.db
twitter_archive_draft.to_sql('twitter_archive_master', engine, if_exists='replace', index=False, chunksize=499)
The above sequence of codes led to this error: OperationalError: too many SQL variables
# Using sqlite3
import sqlite3
# Create the connection
conn = sqlite3.connect('we_rate_dogs.db')
# Convert twitter_archive_draft DataFrame to twitter_archive_master table in the databsase
twitter_archive_draft.to_sql("twitter_archive_master", conn, if_exists="replace", index=False)
conn = sqlite3.connect('we_rate_dogs.db')
df = pd.read_sql_query('SELECT * FROM twitter_archive_master limit 3;', conn)
conn.close()
df
twitter_archive_draft.groupby(by="dog_stage")['rating'].mean() * 10
folder = './we_rate_dogs/image/'
filename = os.path.join(folder, 'dog_stages.png')
(twitter_archive_draft.groupby(by="dog_stage")['rating'].mean() * 10).plot(kind='bar', color="#CCCC99")
plt.ylabel('Average rating (out of 10)')
plt.xlabel('Dog Stage')
plt.grid(b=True, which='major', axis='both', lw=0.25)
plt.xticks(rotation=45, ha="right", color='gray')
plt.savefig(filename, bbox_inches='tight', pad_inches=0.5);
twitter_archive_draft.name.value_counts()[:10]
twitter_archive_draft.timestamp.min(), twitter_archive_draft.timestamp.max()
The timestamp for tweets in this data set range from Nov 15, 2015 to Aug 1, 2017. To compare number of tweets by month, I will restrict timestamp to only all of 2016.
from datetime import datetime as dt
mask = (twitter_archive_draft.timestamp > '2015-12-31') & (twitter_archive_draft.timestamp <= '2016-12-31')
df = twitter_archive_draft.loc[mask].copy()
df['month'] = df.timestamp.dt.month
df = df[['timestamp', 'month']]
df['month'] = df['month'].apply(lambda x: dt(2016, x, 1).strftime('%b'))
folder = './we_rate_dogs/image/'
filename = os.path.join(folder, 'monthly_tweets.png')
plt.figure(figsize=(10,8))
df.month.value_counts().plot(kind='bar', width=0.75, color="#CCCC99")
plt.ylabel('Number of Tweets')
plt.xlabel('Month (in 2016)')
plt.grid(b=True, which='major', axis='both', lw=0.25)
plt.xticks(rotation=45, ha='right', color="gray")
plt.savefig(filename, bbox_inches='tight', pad_inches=0.2);
import requests
from PIL import Image
from io import BytesIO
top_idxs = twitter_archive_draft.sort_values(by='favorite_count', ascending=False).index[:10]
top_six_idxs = twitter_archive_draft.loc[top_idxs][~pd.Series(twitter_archive_draft.loc[top_idxs, 'jpg_url']).duplicated()].index
r = requests.get(twitter_archive_draft.loc[top_six_idxs[0], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[0], 'text'].split('https')[0])
i
r = requests.get(twitter_archive_draft.loc[top_six_idxs[1], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[1], 'text'].split('https')[0])
i
r = requests.get(twitter_archive_draft.loc[top_six_idxs[2], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[2], 'text'].split('https')[0])
i
r = requests.get(twitter_archive_draft.loc[top_six_idxs[3], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[3], 'text'].split('https')[0])
i
r = requests.get(twitter_archive_draft.loc[top_six_idxs[4], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[4], 'text'].split('https')[0])
i
r = requests.get(twitter_archive_draft.loc[top_six_idxs[5], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[5], 'text'].split('https')[0])
i